166 research outputs found
Information Theory-Guided Heuristic Progressive Multi-View Coding
Multi-view representation learning aims to capture comprehensive information
from multiple views of a shared context. Recent works intuitively apply
contrastive learning to different views in a pairwise manner, which is still
scalable: view-specific noise is not filtered in learning view-shared
representations; the fake negative pairs, where the negative terms are actually
within the same class as the positive, and the real negative pairs are
coequally treated; evenly measuring the similarities between terms might
interfere with optimization. Importantly, few works study the theoretical
framework of generalized self-supervised multi-view learning, especially for
more than two views. To this end, we rethink the existing multi-view learning
paradigm from the perspective of information theory and then propose a novel
information theoretical framework for generalized multi-view learning. Guided
by it, we build a multi-view coding method with a three-tier progressive
architecture, namely Information theory-guided hierarchical Progressive
Multi-view Coding (IPMC). In the distribution-tier, IPMC aligns the
distribution between views to reduce view-specific noise. In the set-tier, IPMC
constructs self-adjusted contrasting pools, which are adaptively modified by a
view filter. Lastly, in the instance-tier, we adopt a designed unified loss to
learn representations and reduce the gradient interference. Theoretically and
empirically, we demonstrate the superiority of IPMC over state-of-the-art
methods.Comment: This paper is accepted by the jourcal of Neural Networks (Elsevier)
by 2023. A revised manuscript of arXiv:2109.0234
Rethinking skip connection model as a learnable Markov chain
Over past few years afterward the birth of ResNet, skip connection has become
the defacto standard for the design of modern architectures due to its
widespread adoption, easy optimization and proven performance. Prior work has
explained the effectiveness of the skip connection mechanism from different
perspectives. In this work, we deep dive into the model's behaviors with skip
connections which can be formulated as a learnable Markov chain. An efficient
Markov chain is preferred as it always maps the input data to the target domain
in a better way. However, while a model is explained as a Markov chain, it is
not guaranteed to be optimized following an efficient Markov chain by existing
SGD-based optimizers which are prone to get trapped in local optimal points. In
order to towards a more efficient Markov chain, we propose a simple routine of
penal connection to make any residual-like model become a learnable Markov
chain. Aside from that, the penal connection can also be viewed as a particular
model regularization and can be easily implemented with one line of code in the
most popular deep learning frameworks~\footnote{Source code:
\url{https://github.com/densechen/penal-connection}}. The encouraging
experimental results in multi-modal translation and image recognition
empirically confirm our conjecture of the learnable Markov chain view and
demonstrate the superiority of the proposed penal connection.Comment: 12 pages, 4 figure
Intriguing Property and Counterfactual Explanation of GAN for Remote Sensing Image Generation
Generative adversarial networks (GANs) have achieved remarkable progress in
the natural image field. However, when applying GANs in the remote sensing (RS)
image generation task, an extraordinary phenomenon is observed: the GAN model
is more sensitive to the size of training data for RS image generation than for
natural image generation. In other words, the generation quality of RS images
will change significantly with the number of training categories or samples per
category. In this paper, we first analyze this phenomenon from two kinds of toy
experiments and conclude that the amount of feature information contained in
the GAN model decreases with reduced training data. Then we establish a
structural causal model (SCM) of the data generation process and interpret the
generated data as the counterfactuals. Based on this SCM, we theoretically
prove that the quality of generated images is positively correlated with the
amount of feature information. This provides insights for enriching the feature
information learned by the GAN model during training. Consequently, we propose
two innovative adjustment schemes, namely Uniformity Regularization (UR) and
Entropy Regularization (ER), to increase the information learned by the GAN
model at the distributional and sample levels, respectively. We theoretically
and empirically demonstrate the effectiveness and versatility of our methods.
Extensive experiments on three RS datasets and two natural datasets show that
our methods outperform the well-established models on RS image generation
tasks. The source code is available at https://github.com/rootSue/Causal-RSGAN
A Unified GAN Framework Regarding Manifold Alignment for Remote Sensing Images Generation
Generative Adversarial Networks (GANs) and their variants have achieved
remarkable success on natural images. However, their performance degrades when
applied to remote sensing (RS) images, and the discriminator often suffers from
the overfitting problem. In this paper, we examine the differences between
natural and RS images and find that the intrinsic dimensions of RS images are
much lower than those of natural images. As the discriminator is more
susceptible to overfitting on data with lower intrinsic dimension, it focuses
excessively on local characteristics of RS training data and disregards the
overall structure of the distribution, leading to a faulty generation model. In
respond, we propose a novel approach that leverages the real data manifold to
constrain the discriminator and enhance the model performance. Specifically, we
introduce a learnable information-theoretic measure to capture the real data
manifold. Building upon this measure, we propose manifold alignment
regularization, which mitigates the discriminator's overfitting and improves
the quality of generated samples. Moreover, we establish a unified GAN
framework for manifold alignment, applicable to both supervised and
unsupervised RS image generation tasks
Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization
Zero-shot skeleton-based action recognition aims to recognize actions of
unseen categories after training on data of seen categories. The key is to
build the connection between visual and semantic space from seen to unseen
classes. Previous studies have primarily focused on encoding sequences into a
singular feature vector, with subsequent mapping the features to an identical
anchor point within the embedded space. Their performance is hindered by 1) the
ignorance of the global visual/semantic distribution alignment, which results
in a limitation to capture the true interdependence between the two spaces. 2)
the negligence of temporal information since the frame-wise features with rich
action clues are directly pooled into a single feature vector. We propose a new
zero-shot skeleton-based action recognition method via mutual information (MI)
estimation and maximization. Specifically, 1) we maximize the MI between visual
and semantic space for distribution alignment; 2) we leverage the temporal
information for estimating the MI by encouraging MI to increase as more frames
are observed. Extensive experiments on three large-scale skeleton action
datasets confirm the effectiveness of our method. Code:
https://github.com/YujieOuO/SMIE.Comment: Accepted by ACM MM 202
Spatio-Temporal Branching for Motion Prediction using Motion Increments
Human motion prediction (HMP) has emerged as a popular research topic due to
its diverse applications, but it remains a challenging task due to the
stochastic and aperiodic nature of future poses. Traditional methods rely on
hand-crafted features and machine learning techniques, which often struggle to
model the complex dynamics of human motion. Recent deep learning-based methods
have achieved success by learning spatio-temporal representations of motion,
but these models often overlook the reliability of motion data. Additionally,
the temporal and spatial dependencies of skeleton nodes are distinct. The
temporal relationship captures motion information over time, while the spatial
relationship describes body structure and the relationships between different
nodes. In this paper, we propose a novel spatio-temporal branching network
using incremental information for HMP, which decouples the learning of
temporal-domain and spatial-domain features, extracts more motion information,
and achieves complementary cross-domain knowledge learning through knowledge
distillation. Our approach effectively reduces noise interference and provides
more expressive information for characterizing motion by separately extracting
temporal and spatial features. We evaluate our approach on standard HMP
benchmarks and outperform state-of-the-art methods in terms of prediction
accuracy
Unbiased Image Synthesis via Manifold-Driven Sampling in Diffusion Models
Diffusion models are a potent class of generative models capable of producing
high-quality images. However, they can face challenges related to data bias,
favoring specific modes of data, especially when the training data does not
accurately represent the true data distribution and exhibits skewed or
imbalanced patterns. For instance, the CelebA dataset contains more female
images than male images, leading to biased generation results and impacting
downstream applications. To address this issue, we propose a novel method that
leverages manifold guidance to mitigate data bias in diffusion models. Our key
idea is to estimate the manifold of the training data using an unsupervised
approach, and then use it to guide the sampling process of diffusion models.
This encourages the generated images to be uniformly distributed on the data
manifold without altering the model architecture or necessitating labels or
retraining. Theoretical analysis and empirical evidence demonstrate the
effectiveness of our method in improving the quality and unbiasedness of image
generation compared to standard diffusion models
MetaMask: Revisiting Dimensional Confounder for Self-Supervised Learning
As a successful approach to self-supervised learning, contrastive learning
aims to learn invariant information shared among distortions of the input
sample. While contrastive learning has yielded continuous advancements in
sampling strategy and architecture design, it still remains two persistent
defects: the interference of task-irrelevant information and sample
inefficiency, which are related to the recurring existence of trivial constant
solutions. From the perspective of dimensional analysis, we find out that the
dimensional redundancy and dimensional confounder are the intrinsic issues
behind the phenomena, and provide experimental evidence to support our
viewpoint. We further propose a simple yet effective approach MetaMask, short
for the dimensional Mask learned by Meta-learning, to learn representations
against dimensional redundancy and confounder. MetaMask adopts the
redundancy-reduction technique to tackle the dimensional redundancy issue and
innovatively introduces a dimensional mask to reduce the gradient effects of
specific dimensions containing the confounder, which is trained by employing a
meta-learning paradigm with the objective of improving the performance of
masked representations on a typical self-supervised task. We provide solid
theoretical analyses to prove MetaMask can obtain tighter risk bounds for
downstream classification compared to typical contrastive methods. Empirically,
our method achieves state-of-the-art performance on various benchmarks.Comment: Accepted by NeurIPS 202
Learning to Sample Tasks for Meta Learning
Through experiments on various meta-learning methods, task samplers, and
few-shot learning tasks, this paper arrives at three conclusions. Firstly,
there are no universal task sampling strategies to guarantee the performance of
meta-learning models. Secondly, task diversity can cause the models to either
underfit or overfit during training. Lastly, the generalization performance of
the models are influenced by task divergence, task entropy, and task
difficulty. In response to these findings, we propose a novel task sampler
called Adaptive Sampler (ASr). ASr is a plug-and-play task sampler that takes
task divergence, task entropy, and task difficulty to sample tasks. To optimize
ASr, we rethink and propose a simple and general meta-learning algorithm.
Finally, a large number of empirical experiments demonstrate the effectiveness
of the proposed ASr.Comment: 10 pages, 7 tables, 3 figure
Modeling Multiple Views via Implicitly Preserving Global Consistency and Local Complementarity
While self-supervised learning techniques are often used to mining implicit
knowledge from unlabeled data via modeling multiple views, it is unclear how to
perform effective representation learning in a complex and inconsistent
context. To this end, we propose a methodology, specifically consistency and
complementarity network (CoCoNet), which avails of strict global inter-view
consistency and local cross-view complementarity preserving regularization to
comprehensively learn representations from multiple views. On the global stage,
we reckon that the crucial knowledge is implicitly shared among views, and
enhancing the encoder to capture such knowledge from data can improve the
discriminability of the learned representations. Hence, preserving the global
consistency of multiple views ensures the acquisition of common knowledge.
CoCoNet aligns the probabilistic distribution of views by utilizing an
efficient discrepancy metric measurement based on the generalized sliced
Wasserstein distance. Lastly on the local stage, we propose a heuristic
complementarity-factor, which joints cross-view discriminative knowledge, and
it guides the encoders to learn not only view-wise discriminability but also
cross-view complementary information. Theoretically, we provide the
information-theoretical-based analyses of our proposed CoCoNet. Empirically, to
investigate the improvement gains of our approach, we conduct adequate
experimental validations, which demonstrate that CoCoNet outperforms the
state-of-the-art self-supervised methods by a significant margin proves that
such implicit consistency and complementarity preserving regularization can
enhance the discriminability of latent representations.Comment: Accepted by IEEE Transactions on Knowledge and Data Engineering
(TKDE) 2022; Refer to https://ieeexplore.ieee.org/document/985763
- …